One of the most fundamental problems in robotics is the Simultaneous Localization and Mapping (SLAM) problem.
This problem arises when the robot does not have access to a map of the environment and does not know its own pose. 
This knowledge is critical for robots to operate autonomously.
SLAM is an active research area in robotics.
A variety of solutions have been developed.
Most solutions rely on large and heavy sensors that have a high range and accuracy (e.g., SICK laser rangefinder).
However, these sensors cannot be used on small (flying) vehicles.
As a result, researchers focused on using vision sensors, which offer a good balance in terms of weight, accuracy and power consumption.
Lightweight cameras are especially attractive for small flying vehicles (MAVs).

Almost all publications related to this research describe a methodology that is able to learn the environment, but do not produce a visual map.
An exception is the research by Steder, which builds a visual map of the environment.


\section{Visual-map SLAM}

\subsubsection{Visual SLAM for flying vehicles}
\label{sec:realted-visual-slam-steder}
The approach presented in this thesis is inspired by Steder et al. \cite{steder2008visual}, who presented a system to learn large visual maps of the ground using flying vehicles.
The setup used is comparable to the AR.Drone, with an inertial sensor and a low-quality camera pointing downward.
If a stereo camera setup is available, their system is able to learn visual elevation maps of the ground.
If only one camera is carried by the vehicle, the system provides a visual map without elevation information.
Steder uses a graph-based formulation of the SLAM problem, in which the poses of the vehicle are described by the nodes of a graph.
Every time a new image is acquired, the current pose of the camera is computed based on both visual odometry and place revisiting.
The corresponding node is augmented to the graph.
Edges between these nodes represent spatial constraints between them.
The constructed graph serves as input to a TORO-based network optimizer (Section \ref{sec:prob-rob-toro}), which minimizes the error introduced by the constraints.
The graph is optimized if the computed poses are contradictory.

\begin{figure}[htb!]
  \begin{center}
    \subfigure[Map constructed by a blimp (non-rigid airship)]{\label{steder_blimp}\includegraphics[height=3.5cm]{images/steder_blimp.png}}
    \hspace{1cm}
    \subfigure[Map constucted by a stereo camera platform. A sensor platform was mounted on a rod to simulate a freely floating vehicle.]{\label{steder_outdoor}\includegraphics[height=3.5cm]{images/steder_outdoor.png}}
   
  \end{center}
  \caption{Visual maps obtained with the visual SLAM system presented in \cite{steder2008visual} (Courtesy Steder et al. \cite{steder2008visual}).}
  \label{featureImg}
\end{figure}

Each node models a 6 degree of freedom camera pose.
The spatial constraints between two poses are computed from the camera images and the inertia measurements.
%The camera images are used to estimate the relative motion of the camera.
To do so, visual features are extracted from the images obtained from down-looking cameras.
Steder uses Speeded-Up Robust Features (SURF) (Section \ref{sec:background-feature-extraction}) that are invariant with respect to rotation and scale.
By matching features in current image to the ones stored in the previous $n$ nodes, one can estimate the relative motion of the camera (Section \ref{sec:background-visual-odometry}).
The inertia sensor provides the roll and pitch angle of the camera.
This reduces the dimensionality of each pose that needs to be estimated from $\R^6$ to $\R^4$.
For each node, the observed features as well as their 3D positions relative to the node are stored.
The constraints between nodes are computed from the features associated with the nodes.

In the case of place revisiting, they compare the features of the current frame with all previous features.
To speed up this potentially expensive operation, multiple filters are used.
Firstly, only the features from robot poses that lie within Tipaldi's confidence interval \cite{tipaldi2007approximate} are used.
Secondly, only the best features from the current image are used (i.e., features with the lowest descriptor distance during visual odmetry).
Finally, a $k$-D tree is used to efficiently query for similar features, together with the best-bins-first technique proposed by Lowe \cite{lowe1999object}.

The camera pose is computed as follows.
Using known camera calibration parameters, the positions of the features are projected on a normalized image plane.
Now, the altitude of the camera is computed by exploiting the similarity of triangles.
Once the altitude is known, the yaw (Section \ref{sec:platform-quadrotor-flight-control}) of the camera is computed by projecting map features 
%(from previous camera images)
into the same normalized image plane.
When matching two features from the camera image against two features from the map, the yaw is the angle between the two  lines on this plane.
Finally, the feature positions from the camera image are projected into the map according to the known altitude and yaw angle.
The $x$ and $y$ coordinates are determined as the difference between the positions of the map features and the projections of the corresponding image points.

Both visual odometry and place revisiting return a set of correspondences, from which the the most likely camera transformation is computed.
First, these correspondences are ordered according to the Euclidean distance of their descriptor vectors, such that the best correspondences are used first.
The transformation $T_{c_a,c_b}$is determined for each correspondence pair.
This transformation is then evaluated based on the other features in both sets using a score function.
The score function calculates the relative displacement between the image features and the map features projected into the current camera image.
The solution with the highest score is used as estimated transformation.

Feature correspondences between images are selected using a deterministic PROSAC \cite{chum2005matching} algorithm.
PROSAC takes into account a quality measure (e.g., distance between feature descriptors) of the correspondences during sampling, where RANSAC draws the samples uniformly.

Steder's article describes how to estimate the motion of a vehicle and perform place revisiting using a feature map.
However, this article and Steder's other publications do not describe how the visual map is constructed and visualized.
Furthermore, it lacks how an elevation map is constructed and how this elevation map can be used to improve the position estimates (e.g., elevation constraints).
A great disadvantage of the described method is the computational cost of the optimization method, which cannot be performed online (i.e. during flight). This reduces the number of practical applications.


\subsubsection{Online Mosaicking}
\label{related-online-mosaicking}
While Steder uses a Euclidean distance measure to compare the correspondences of the features in two images, Caballero et al. \cite{caballero2009unmanned} indicate that this is a last resort.
They are able to make a robust estimation of the spatial relationship on different levels: homogeneous, affine and Euclidean (as described in Section \ref{sec:background-projective-geometry}).
The article addresses the problem for aerial vehicles with a single camera, like the AR.Drone.
%Geo-referenced mosaics can be sufficient as environment model for certain tasks.
A mosaic is built by aligning a set of images gathered to a common frame, while the aerial vehicle is moving.
A set of matches between two views can be used to estimate a homographic model for the apparent image motion.

\begin{figure}[htb]
\centering
\includegraphics[width=6cm]{images/Caballero_map.png}
\caption{Mosaic constructed with a KARMA autonomous airship flying at $22\small{m}$ altitude (Courtesy Caballero et al. \cite{caballero2009unmanned}).}
\label{fig:Caballero_map}
\end{figure}


The homography that relates two given images is computed from sets of matched features.
Depending on the scene characteristics (e.g., the parallax effect and small overlap between images), the homography computation could become a difficult problem. 
A classical solution to improve the results is to introduce additional constraints to reduce the number of degrees of freedom of the system of equations.
This is accomplished through a hierarchy of homographic models, ￼in which the complexity of the model to be fitted is decreased whenever the system of equations is ill-constrained.
An estimation of this accuracy will be given by the covariance matrix of the computed parameters.

If most of the matches are tracked successfully, a \textit{complete homogeneous transformation} (8 degrees of freedom) is estimated.
Least median of squares (LMedS) is used for outlier rejection and a M-Estimator  \cite{zhang1997parameter} to compute the final result.
When the number of correspondences is too low, an \textit{affine transformation} (6 degrees of freedom) is estimated.
LMedS is not used, given the reduction in the number of matches.
Instead, a relaxed M-Estimator (soft penalization) is carried out to compute the model.
Only when the features in the image are really noisy a \textit{Euclidean transformation} (4 degrees of freedom) is estimated.
The model is computed using least-squares.
If the selected hierarchy level is not constrained enough (e.g., the M-Estimator diverges by reaching the maximum number of iterations), the algorithm decreases the model complexity.
Once the homography is computed, it is necessary to obtain a measure of the estimation accuracy.
Caballero uses a $9 \times 9$ covariance matrix of the homography matrix, which is explained in \cite{Hartley2004}.

The motion of the vehicle is computed using the estimated homographies.
This method assumes that the terrain is approximately flat and that cameras are calibrated.
$H_{12}$ is the homography that relates the first and the second view of the planar scene.
Both projections can be related to the camera motion as:
\begin{equation}
H_{12} = AR_{12}(I - \frac{t_2n_i^T}{d_1})A^{-1}
\end{equation}
where $A$ is the camera calibration matrix, $t_2$ is the relative translation of the second view, $n_1$ is an unitary vector normal to the plane in the first camera coordinate frame, $d_1$ is the distance from the first camera to the plane and $R_{12}$ is the rotation matrix that transforms a vector in the first camera coordinate frame into a vector expressed in the second camera coordinate frame.
If the camera calibration matrix $A$ and the distance to the ground $d_1$ are known, it is possible to extract the motion of the vehicle.
Triggs \cite{triggs1998autocalibration} proposes a robust algorithm based on the singular value decomposition of the calibrated homography, defined by:
\begin{equation}
H_{12}^{u} = A^{-1} H_{12} A
\end{equation}

A classical static mosaic is not appropriate, as the uncertainties should be taken into account along the mosaicking process.
Therefore, the mosaic is augmented by including stochastic information, resulting in a set of images linked by stochastic relations.

The position of the image inside the mosaic is obtained by multiplying the current homography by all the previous homographies.
This estimation will drift along time, but can be compensated by building a mosaic.
By comparing the current image with images previously stored in the mosaic, loop-closures can be detected and used to eliminate the accumulated drift.
The crossover detection consists of finding one image whose Mahalanobis distance is within a certain empirical range.
Once an image is detected, a feature matching is used to compute the alignment between both images. 
In the general case, the task of matching images taken from very different viewpoints is difficult and computationally costly.
For this reason, the image is warped to match the image from the mosaic, which simplifies the problem.
The computed homography is used to obtain the correct alignment.
Now, the relations among the most recent image and the corresponding image from the mosaic can be updated.

The relations between images are maintained and updated by using an Extended Kalman Filter (Section \ref{sec:background-solution-techniques}).
The state vector including the mean and covariance of all images is defined as:
\begin{equation}
x^{-} = [x_1, x_2, ..., x_n]^T = [h_{01}, x_1 \cdot h_{12}, ..., x_{(n-1)n} \cdot h_{(n-1)n}]^T
\end{equation}
The correction obtained when a loop-closing is detected is used to update the state.
All states affected by the correction are now updated by the Extended Kalman Filter.

Similar to Steder's research, Caballero uses a probabilistic representation of the estimated poses.
Both approaches optimize the map after a loop-closure event is detected, but use different optimization methods.
Unfortunately, these optimization methods are computationally expensive and difficult to perform while flying.
Steder exploits the information from the inertia sensor to reduce the complexity of the pose estimation problem, without decreasing the degree of freedom of the estimated pose.



\section{Airborne elevation mapping with an ultrasound sensor}
\label{sec:related-research-elevation-mapping}
Multiple techniques have been presented that successfully employ an airborne radar sensor for building elevation maps (e.g., \cite{foessel2000radar, weiß2006airborne}).
Due to its large footprint, a radar sensor cannot be mounted on small UAVs.
Unlike a radar sensor, an ultrasound sensor can be carried by a small UAV.
This sensor provides feedback for altitude stabilization and obstacle detection.
According to my knowledge, no publication addresses the problem of elevation mapping using a single airborne ultrasound sensor.
However, the problem is addressed using advanced (e.g., multibeam) ultrasound sensors.
These advanced ultrasound sensors are commonly used for obstacle detection and navigation by underwater robots.
%are popular for underwater navigation \cite{majumder2001multisensor}.
Underwater robots observe the seabed from above, very similar to the way airborne vehicles observe the floor from above.
This similarity suggests that methods developed for underwater mapping can be used for airborne mapping.

\subsubsection{Seabed mapping}
Seabed mapping is a research area in which ultrasound sensors have been used for building elevation maps. Some examples are \cite{johnson1996seafloor,strauss1999multibeam,zerr1996three,evans2002three}.
Kenny et al. \cite{kenny2003overview} give an overview of seabed-mapping technologies.
Elevation mapping is essential for an autonomous underwater vehicle designed to navigate close to the seabed.
The map is being used for obstacle avoidance, path planning and localization.

Given the limited range and applicability of visual imaging systems in an underwater environment,
sonar has been the preferred solution \cite{blondel1997handbook} for observing the seabed.
Most approaches use a side scan sonar system that returns an image instead of a single distance measurements.
Side scan uses a sonar device that emits conical or fan-shaped pulses down toward the seabed.
The intensity of the acoustic reflections are recorded in multiple transducers.
When stitched together along the direction of motion, these slices form an image of the sea bottom within the coverage of the beam.
Modern side scan devices offer high-resolution images of the seabed on which objects of at least $10\begin{small}cm\end{small}$ may be detected at a range of up to $100\begin{small}m\end{small}$.

\begin{figure}[htb]
\centering
\includegraphics[width=7.5cm]{images/williams_seabed_map.jpg}
\caption{Seabed terrain model built by projecting the texture of the visual images onto a surface model generated by the sonar data.
The grey areas represent portions of the terrain that are only observed by the sonar. (Courtesy Williams and Mahon \cite{williams2004simultaneous}).}
\label{fig:williams_seabed_map}
\end{figure}

A particularly interesting application of sonar-based elevation mapping is presented in \cite{williams2004simultaneous}.
The article presents a SLAM system that uses information from the on-board sonar and vision system of an Unmanned Underwater Vehicle (UUV).
The system estimates the vehicle's motion and generates a model of the structure of the underlying reefs.
%This work shows correspondences to the methods used in this thesis.
Unlike the AR.Drone, the pencil beam scanning sonar of the UUV returns high-resolution images instead of a single distance measurement.

In cases where a priori maps are not available, a mechanism must exist for building a map while simultaneously using that map for
%the purposes of
localization.
During motion, the vehicle builds a complete map of landmarks and uses these to provide continuous estimates of the vehicle location.
By tracking the relative position between the vehicle and identifiable features in the environment, both the position of the vehicle and the position of the features can be estimated simultaneously.
An Extended Kalman Filter is used for stochastic estimation of the vehicle's position and map feature locations.
The state is represented by an augmented state vector $\hat{x}^{+}(t)$, consisting of $n_v$ states.
\begin{equation}
\hat{x}^+ = \left[ {
\begin{array}{c} \hat{x}_0^+ \\ \vdots \\ \hat{x}_{n_v}^+ \\  \end{array}
} \right]
\end{equation}
Each state $\hat{x}_t^+$ consists of a vehicle state $\hat{x}_v$ and the observed features $\hat{x}_{f, i}, i = 0, ..., n_f$, where $n_f$ is the number of features.
\begin{equation}
\hat{x}_t^+ = \left[ {
\begin{array}{c} \hat{x}_v \\ \hat{x}_{f, 0} \\ \vdots \\ \hat{x}_{f, n_f} \\  \end{array}
} \right]
\end{equation}
The covariance matrix for the state at time $t$ is defined as:
\begin{equation}
\textbf{P}^+_t = E[(x_t - \hat{x}^+_t)(x_t - \hat{x}^+_t)^T | \textbf{Z}^k]
\end{equation}
It was assumed that the roll and pitch of the vehicle were negligible, based on observations of the vehicle performance during in-water tests and results from the design of the vehicle.
The vehicle pose at time step $t$ is represented by
\begin{equation}
\hat{x}^+_v(t) = \left[ \begin{array}{cccc} \hat{\textit{x}}^+_v(t) & \hat{\textit{y}}^+_v(t) & \hat{\textit{z}}^+_v(t) & \hat{\psi}^+_v(t) \end{array}\right]^T
\end{equation}
where $\hat{\textit{x}}^+_v(t)$, $\hat{\textit{y}}^+_v(t)$ and $\hat{\textit{z}}^+_v(t)$ are the $x$, $y$ and $z$ coordinates of the vehicle at time $t$ and $\hat{\psi}^+_v(t)$ is the vehicle's yaw at time $t$.
The features states are represented by
\begin{equation}
\hat{x}^+_i(t) = \left[ \begin{array}{ccc} \hat{\textit{x}}^+_i(t) & \hat{\textit{y}}^+_i(t) & \hat{\textit{z}}^+_i(t) \end{array}\right]^T
\end{equation}
where $\hat{\textit{x}}^+_i(t)$, $\hat{\textit{y}}^+_i(t)$ and $\hat{\textit{z}}^+_i(t)$ are the $x$, $y$ and $z$ coordinates of a feature.

Ultrasound and vision information can be combined to aid in the identification and classification of natural features present in the environment.
For maximum overlap of both sensors the ultrasound sensor is mounted directly above the high-resolution vision system.
The ultrasound sensor scans the seabed directly below the vehicle and is used to generate
profiles of the surface.
The distance measurements received by the ultrasound sensor are then processed to identify distinctive shapes which are used to initialize new features in the SLAM map.
The observation of range $R$ and bearing $\theta$ are combined with the estimated value of the vehicle pose $\hat{x}_v(t)$ and the measured position
$\left[ \begin{array}{ccc} \textit{x}_s & \textit{y}_s & \textit{z}_s \end{array}\right]^T$ of the sonar relative to the camera frame.
\begin{equation}
\left[ {
\begin{array}{c} \hat{\textit{x}}_i \\ \hat{\textit{y}}_i \\ \hat{\textit{z}}_i \end{array}
} \right] =
\left[ {
\begin{array}{c}
	\hat{\textit{x}}_v + \textit{x}_s cos \hat{\psi}_v - (\textit{y}_s + R cos \theta) sin \hat{\psi}_v \\
	\hat{\textit{y}}_v + \textit{x}_s sin \hat{\psi}_v + (\textit{y}_s + R cos \theta) cos \hat{\psi}_v \\
	\hat{\textit{z}} + \textit{z}_s + R sin \theta
\end{array}
} \right]
\end{equation}

Once the ultrasound features have been initialized, they are tracked using observations from the vision system.
When a new feature is initialized, the sonar footprint is projected into the visual frame of the camera.
The center of this footprint is then used to identify a high contrast feature in the image within the area covered by the sonar.
These visual features are tracked from frame to frame using the Lucas and Kanade feature tracking technique \cite{lucas1998application}.
Observation of elevation $z_e$ and azimuth $z_a$ (orientation) are provided to the SLAM algorithm.
The estimated elevation and azimuth are computed using the current estimate of the vehicle pose and feature position:
\begin{equation}
\left[ {
\begin{array}{c} \hat{z}_e \\ \hat{z}_a \end{array}
} \right] =
\left[ {
\begin{array}{c}
	(\hat{x}_i - \hat{x}_v) cos \hat{\psi}_v + (\hat{y}_i - \hat{y}_v) sin \hat{\psi}_v \\
	-(\hat{x}_i - \hat{x}_v) sin \hat{\psi}_v + (\hat{y}_i - \hat{y}_v) cos \hat{\psi}_v
\end{array}
} \right]
\end{equation}
The difference between the actual observation received from the camera and the predicted observation is the innovation, which is used in computing the posterior state estimate.
The current implementation makes use of very simple features.
These features are sufficient for tracking purposes, but are not applicable to detect loop-closures.


\section{Research based on the AR.Drone}
Because the AR.Drone is a quite recent development, the number of studies based on this platform is limited.

\subsubsection{Autonomous corridor and staircase flights}

A recent publication is from Cornell University \cite{Bills2011icra}, where an AR.Drone is used to automatically navigate corridors and staircases based on visual clues.
Their method classifies the type of indoor environment (e.g., corridors, staircases, rooms and corners) and then uses vision algorithms based on perspective clues to estimate the desired direction to fly.
This method requires little computational power and can be used directly without building a 3D model of the environment.

\begin{figure}[htb!]
  \begin{center}
    \subfigure[(Top) Original corridor image. (Bottom) Corridor processed with Canny edge detector and Hough transform to find the vanishing point (blue circle).]{\label{fig:cornell-corridor}\includegraphics[height=6cm]{images/ardrone-corridor.jpg}}
\hspace{1cm}
    \subfigure[(Top) Original image of staircase. (Bottom) Processed image where the bold red line marks the location of the staircase.]{\label{fig:cornell-staircase}\includegraphics[height=6cm]{images/ardrone-staircase.jpg}}
   
  \end{center}
  \caption{Autonomous corridor and staircase flights based on line detection (Courtesy Bills et al. \cite{Bills2011icra}).}
  \label{featureImg}
\end{figure}

The type of environment is determined using two algorithms.
The first method uses GIST features \cite{oliva2001modeling} (Section \ref{sec:background-feature-extraction}) with a Support Vector Machine (SVM) learning algorithm for classification.
GIST features are well suited for this task because they directly measure the global distribution of oriented line segments in an image, which takes advantage of the long lines found in indoor images.
The second method computes a confidence estimate from both the staircase and corridor vision algorithms.
The confidence values are compared and the highest is used to select the environment type.

When navigating corridors, parallel lines appear to converge.
The points where lines appear to converge are called vanishing points.
Vanishing points are used to locate the end of the corridor. 
The Canny edge detector \cite{canny1986computational} is used to detected edges and a probabilistic Hough transform \cite{kiryati1991probabilistic} is used to find lines.
The vanishing points (end of the corridor) are found by searching for the image region that has the highest density of pair-wise line intersections.
A probabilistic model with Gaussians is constructed to model the noise in the location of a vanishing point.
Staircases can be navigated by flying to the center of the staircase. %
The AR.Drone's onboard intelligence will automatically stabilize the altitude.
%, so the AR.Drone can follow the staircase using its onboard intelligence (i.e., altitude stabilization).
Lines that represent the staircase are found by classifying the line segments as horizontal or vertical and looking for the largest horizontal-line cluster in the Hough transform space.
Once the cluster of horizontal lines is detected, the mean of the lines' endpoints is calculated to retrieve the desired direction.

\subsubsection{Map\&replay bearing-only navigation}
Krajn{\'\i}k et al. \cite{krajník2010simple,faiglsurveillance,krajník2011ar} present a simple monocular navigation system based on the map\&replay technique.
For this technique, a robot is navigated by a human through an environment and creates a map of its environment.
After the map is created, the robot starts at the start position and repeats the learned path.
The method can navigate a robot along a given path while observing only one landmark at a time, making it more robust than other monocular approaches.

The authors simplify the problem by splitting the route in straight line segments (i.e., the robot travels in straight lines and rotates between segments).
%At first, the robot is tele-operated in the environment along straight line segments in the mapping phase.
For each segment a set of visual landmarks is remembered, consisting of salient features extracted from the frontal camera image.
Similar to Steder, the authors use Speeded-Up Robust Features (SURF) to detect and represent salient features.
When entering a segment, the robot is rotated to match the rotation during the mapping phase.
During the navigation along a segment, the robot establishes correspondences of the currently seen and previously mapped landmarks and computes differences in the expected and recognized positions for each such correspondence.
The robot steers in a direction that reduces those differences while moving straight at a constant speed until its odometry indicates that the current segment has been traversed.
The heading is determined by a histogram voting procedure.
%This navigation method is provably stable for non-degenerate polygonal paths, since the heading corrections can suppress the position uncertainty originating from the dead reckoning.
%The distance traveled along a segment is estimated purely by dead reckoning.
%Due to the low precision of the IMU based distance estimation, the localization error is high.
%However, the localization error does not diverge and is kept within sufficient limits allowing the drone to autonomously navigate along the learned path.
